Identifying Authorship by Byte-Level N-Grams:
نویسندگان
چکیده
Source code author identification deals with identifying the most likely author of a computer program, given a set of predefined author candidates. There are several scenarios where digital evidence of this kind plays a role in investigation and adjudication, such as code authorship disputes, intellectual property infringement, tracing the source of code left in the system after a cyber attack, and so forth. As in any identification task, the disputed program is compared to undisputed, known programming samples by the predefined author candidates. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles representing the source code author’s style. The SCAP method extends a method originally applied to natural language text authorship attribution; we show that an n-gram approach also suits the characteristics of source code analysis. The methodological extension includes a simplified profile and a less complicated, but more effective, similarity measure. Experiments on data sets of different programming-language (Java or C++) and commented/commentless code demonstrate the effectiveness of these extensions. The SCAP approach is programming-language independent. Moreover, the SCAP approach deals surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. Finally, it is also demonstrated that SCAP effectiveness persists even in the absence of comments in the source code, a condition usually met in cyber-crime cases. 1. The Forensic Significance of Source Code Nowadays, in a wide variety of legal cases it is important to identify the author of a usually limited piece of programming code. Such situations include cyber attacks in the form of viruses, Trojan horses, logic bombs, fraud, and credit card cloning, code authorship disputes, and intellectual property infringement. Identifying the authorship of malicious or stolen source code in a reliable way has become a primary goal for digital investigators (Spafford and Weeber 1993). Please see Appendix 1 for a legal analysis of the forensic significance of source code.
منابع مشابه
N-gram-based Author Profiles for Authorship Attribution
We present a novel method for computer-assisted authorship attribution based on characterlevel n-gram author profiles, which is motivated by an almost-forgotten, pioneering method in 1976. The existing approaches to automated authorship attribution implicitly build author profiles as vectors of feature weights, as language models, or similar. Our approach is based on byte-level n-grams, it is l...
متن کاملIdentifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method
Source code author identification deals with identifying the most likely author of a computer program, given a set of predefined author candidates. There are several scenarios where digital evidence of this kind plays a role in investigation and adjudication, such as code authorship disputes, intellectual property infringement, tracing the source of code left in the system after a cyber attack,...
متن کاملCNG Method with Weighted Voting
CNG Method for Authorship Attribution. The Common N-Grams (CNG) classification method for authorship attribution (AATT) was described in [2]. The method is based on extracting the most frequent byte n-grams of size n from the training data. The n-grams are sorted by their normalized frequency, and the first L most-frequent n-grams define an author profile. Given a test document, the test profil...
متن کاملComparing techniques for authorship attribution of source code
Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non-natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n-g...
متن کاملSerbian Text Categorization Using Byte Level n-Grams
This paper presents the results of classifying Serbian text documents using the byte-level n-gram based frequency statistics technique, employing four different dissimilarity measures. Results show that the byte-level n-grams text categorization, although very simple and language independent, achieves very good accuracy.
متن کامل